A trainable spoken language understanding system for visual object selection
نویسندگان
چکیده
We present a trainable, visually-grounded, spoken language understanding system. The system acquires a grammar and vocabulary from a “show-and-tell” procedure in which visual scenes are paired with verbal descriptions. The system is embodied in a table-top mounted active vision platform. During training, a set of objects is placed in front of the vision system. Using a laser pointer, the system points to objects in random sequence, prompting a human teacher to provide spoken descriptions of the selected objects. The descriptions are transcribed and used to automatically acquire a visually-grounded vocabulary and grammar. Once trained, a person can interact with the system by verbally describing objects placed in front of the system. The system recognizes and robustly parses the speech and points, in real-time, to the object which best fits the visual semantics of the spoken description.
منابع مشابه
Grounding Natural Spoken Language Semantics in Visual Perception and Motor Control
A characteristic shared by most approaches to natural language understanding and generation is the use of symbolic representations of word and sentence meanings. Frames and semantic nets are two popular current approaches. Symbolic methods alone are inadequate for applications such as conversational robotics that require natural language semantics to be linked to perception and motor control. T...
متن کاملA Trainable Visually-grounded Spoken Language Generation System
A spoken language generation system has been developed that learns to describe objects in computer-generated visual scenes. The system is trained by a ‘show-and-tell’ procedure in which visual scenes are paired with natural language descriptions. Learning algorithms acquire probabilistic structures which encode the visual semantics of phrase structure, word classes, and individual words. Using ...
متن کاملEvaluating a Trainable Sentence Planner for a Spoken Dialogue System
Techniques for automatically training modules of a natural language generator have recently been proposed, but a fundamental concern is whether the quality of utterances produced with trainable components can compete with hand-crafted template-based or rulebased approaches. In this paper We experimentally evaluate a trainable sentence planner for a spoken dialogue system by eliciting subjective...
متن کاملVisual Context Driven Semantic Priming of Speech Recognition and Understanding
Fuse is a spoken language understanding system that integrates visual context into early stages of speech recognition. Given a visual scene and a spoken description, the system finds the object in the scene that best fits the meaning of the description. To solve this task, Fuse performs speech recognition and visually-grounded language understanding. Rather than treat these two problems separat...
متن کاملTowards situated speech understanding: visual context priming of language models
Fuse is a situated spoken language understanding system that uses visual context to steer the interpretation of speech. Given a visual scene and a spoken description, the system finds the object in the scene that best fits the meaning of the description. To solve this task, Fuse performs speech recognition and visually-grounded language understanding. Rather than treat these two problems separa...
متن کامل